Raison d’être

This book is prepared as part of the R4GC Community skill enhancing and knowledge gathering exercise.
It aims at consolidating the knowledge-based that is being gathered by the R4GC Community spread across various R4GC community portals and discussions.
It also serves to illustrate - as always with source code - one of the most powerful
features of R, which is
the collaborative peer-reviewed development of data science codes and reports using R Markdown.
Contributors
R4GC is a collaborative effort of many people who have contributed to the development of the knowledgebase that is gathered in this book.
They are listed below.
Jonathan Dench, Joseph Stinziano, Henry Luan, Eric Littlewood, Philippe-Israel Morin, Tony Machado, Maxime Girouard, Martin Jean, Tim Roy, Mehrez Samaali, Dejan Pavlic, Utku Suleymanoglu, Alex Goncharov.
Additionally, much support has been also received from wider international R community through stackoverflow.org portal and knowledge-sharing events organized by the RStudio, as well as from several other Government of Canada employees who remained anonymous.
Their help is greatly appreciated.
Key principles
Chatham House Rule
This book is prepared using the Chatham House rule.
The Chatham House Rule helps create a trusted environment to understand and resolve complex problems through dialog and timely open communication.
Its guiding spirit is: share the information you receive, but do not reveal the identity of who said it. Hence, no attributions are made and the identity of speakers
and participants is not disclosed.
It is based on the views and codes
contributed by community members as part of onging community events interactions. Offered as
a means to facilitate the discussion, the document does not constitute
an analytical document, nor does it represent any formal position of any
organisation involved.
History
R4GC Community (formerly called “Use R!” GCCollab community)
was created in March 2021 to bring together the R users across the Government of Canada.
Here we gather and curate the knowledgebase related to the use of R within the Government of Canada.
Everyone is welcome to join, whether you are an advanced R user, just starting learning it, or simply want to learn more about data science and how it is done.
The idea to create this group came after the GC Data2021 Conference Data Literacy Fest workshop on Data Engineering Challenges and Solutions: Demonstration of Shiny. The highest voted question during the discussion there was : “How can I get more help for our members to enhance their knowledge and”spread the word" and raise more awareness regard to this tool?" The creation of this community group is the answer to this question.
By November 2021, the R4GC GCCollab group has been become one of the largest active data science practitioners groups in Canada, counting over a quarter thousand of members. The weekly “Lunch and Learn Data Science with R” meetups organized by the R4GC Community have been attended by data practitioners from over twenty government departments, and generated
hundreds of questions/answers, a dozen of tutorials, multiple open to use applications, and thousands of line of open code.
On October 29, 2021, the work of this group was presented at the 2021 International Methodology Symposium.
This book aims at consolidating all knowledgebase gathered by the R4GC Community at its portals and meetups
Lunch and Learn meetups
"Building advanced Data Science skills using R, together - one meeting at a time!"
These informal meetings are organized weekly during Friday lunch time (from 12:05 to 12:55).
There data scientists wanting to upscale their knowledge of R and other Data Science related subjects get together to show and discuss their R codes and share their data coding tricks and methodologies.
Normally, each session is focused on a particular subject or project with the codes shared on GCcode.
No registration is required to join the meeting. However, in order to view the notes and video-recordings from these meetups, you need to join this sub-group: https://gccollab.ca/groups/about/7855030.
For Agenda and Dial-in MS Teams numbers please see Group Events page at https://gccollab.ca/event_calendar/group/7391537.
Book structure
The book is organized in several parts.
Part I is dedicated to General discussions, which includes the following:
1 Why R?
2 R and Python
3 Best way to learn R
4 Other resources and bookdown textbooks
5 Events and Forums for R users
Part II is dedicated to the Best Practices and Efficient Coding in R and includes the following:
6 From Excel to R
7 data.table: your best fRiend
8 Reading various kinds of data in R
9 Efficient programming in R
Part III is dedicated to Visualization and Reporting and includes the following:
10 R Markdown for literate programming and automated reports
11 ggplot2 and its extensions for data visualization
12 Shiny for Interactive Data Visualization, Analysis and Web App development
13 Using R with GC data infrastructure (gcdocs, AWS, etc)
14 Interactive Outputs in R: ‘plotly’, ‘Datatable’, ‘reactable’
Part IV is dedicated advanced use of Data Science, Machine Learning, and AI. It includes:
15 Record Linking and other Data Engineering tasks in R
16 Geo/Spatial coding and visualization in R
17 Text Analysis in R
18 Machine Learning and Modeling in R
Part V contains the Tutorials developed by and for the community. This are:
19 GCCode 101
20 Packages 101
21 R101: Building COVID-19 Tracker App from scratch
22.1 Geo/Spatial coding and visualization with R. Part 1:
22.2 Text Analysis with R. Part 1:
and a number of short "How To: tutorials such as:
22.3 Dual Coding - Python and R unite !
22.4 Working with ggtables
22.5 Automate common look and feel of your ggplot graphs
and others
Part VI presents the outputs of the community development such as Shiny Web Apps and other codes and libraries developed by community members.
Finally, Appendix includes schedule and agendas for the R4GC community “Lunch and Learn” meetups, Release Notes, plans and invitation for collaboration, in particular to make this knowledgebase bilingual - in support of bilingualism in Canada and as an another opportunity for applying data science skills for public good.
About this book
This is book is built using the bookdown R package in RStudio.
It is hosted at Open Canada GitHub repo https://open-canada.github.io/r4gc.
The source code of it located at https://github.com/open-canada/r4gc.
Thus built, the book enables easy collaboration, transparency and peer-reviewing.
Additionally, as with any markdown file, in addtion to the output visible in the final compiled html file, it also allows one to gather and save discussion comments and draft ideas that are visible inside the source Rmd file.
How to contribute
Any chapter of this book can be edited by simply clicking on the “edit” button, which will lead to the corresponding source Rmd file in the book’s repo, where you can make a change in the document (in doing so, this repo will forked at your github account ) and submit ir to the book editor (by submitting the merge request).
Alternatively, you can always contact the R4GC group lead at the contact listed below and attend R4GC weekly meetups.
About authors
Dmitry Gorodnichy is Research Data Scientist with the Chief Data Office at the Canadian Border Agency.
Patrick Little is Advisor on the Open Government systems team at the Treasury Board of Canada Secretariat.
General discussions
Why R?
R is one of the fastest growing programming languages and environment for data science, visualization and processing.
R integrates well with Python and other free and commercial Data Science tools. It can also do something that other tools cannot and is very well supported by growing international community.
It also comes with RStudio - a free Integrated Development Environment (IDE) that is now supported by most GC Agencies and Departments, and that is becoming one of the main tools for data related problems worldwide. For Microsoft data users, anything you do with Excel, Access, Power BI, you can also do with R and RStudio.
10 main reasons to use R for Data Science
There was much discussion around this topic, and below are our own top 10 reasons to use R for Data Science - filtered and refined by the members of this community.
- Advanced graphics with ggplot2 and its extensions
- Automated report/tutorials/textbooks generation with RMarkdown
- Streamlined package development with devtools
- Streamlined Interactive interfaces and dashboards development and deployment with Shiny
- “Best for geocomputation”*
- Common tidy design shared across packages
- Curated peer-tested repo of packages at CRAN
- RStudio IDE (Integrated Development Environment) on desktop and cloud (rstudio.cloud)
- Full support and inter-operability with Python from the same IDE
- Global RStudio**-led movement for R education and advancement (rstudio.com)
R vs. Python
Main references:
The latter has also a nice summary of PROs vs. CONs for both languages, subjectively summarized by myself below:
Python’s PROs:
- Object-oriented language (this is a big one for me)
- General Purpose , i.e. you can use elsewhere (e.g. with raspberry pi - www.raspberrypi.org, as I did for one of my daughters’ projects)
- Simple and easy to understand and learn (this could be subjective, but I would generally agree that R is not taught they way I personally would teach it, i.e. based on computer science principles, rather than on memorizing a collection of various heuristic tools and functions.)
- Efficient (fast) packages for advanced machine learning activities, e.g. tensorflow or keras (which you can use from R too, but it could be less efficient), also a large collection of audio manipulation / recognition (some of whicg I played with for one of my biometrics projects)
Python CONs:
- Not really designed for advanced manipulation or visualization of data
R’s PRO’s:
- It is designed specifically for data tasks
- CRAN provides >10K peer-tested (!) packages for any data task you may think of . (I would also add - It also provides much great tools to get you OWN work be tested and featured one day in CRAN - which is what we are doing now at our Learn R Meet-ups!)
- ggplot2 graphics is unbeatable
- Capable of standalone analyses with built-in packages. (Anyone knows what it means??)
R’s CON’s:
- Not the fastest or memory efficient (Hey- That’s without
data.table package! I don’t think so,if you use data.table,properly)
- Object-oriented programming is not native or as easy in R, compared to Python. - Actually, I added this one. To me, this is one of main motivations to see how to use both Python and R, instead of R by itself. But for someone else, this could be not tie-breaker though
Other resources
Best way to learn R
The number of resources and ways to learn R is enormous.
Some of us had tried many of them until we found the ones that we believe are the best ones.
Here are some: https://ivi-m.github.io/R-Ottawa/resources.html
And of course, don’t ever shy to ask questions (or seek for already many answers) at https://stackoverflow.com/.
In fact, what a great option for you to save all the knowledgebase you acquire !!
Helping yourself, you also help others, and contribute to further improvement of many R packages !)
Great bookdown textbooks
This book is prepared using bookdown R package.
It is inspired by many other
open source codes and examples.
Below are listed those of particular value in developing this book.
See also textbook reference for each of the community discussion topics later in the book.
Events and Forums for R users
Other R communities in GC
Roughly sorted by the level of group activity
Other groups
R conferences
(from https://rviews.rstudio.com/2021/03/03/2021-r-conferences)
Cascadia RConf 2021 (June 4 - 5), a jewel of a regional R conference for its first three years, was canceled in 2020. It is back this year as a virtual event. The Call for Presentations is open.
useR! 2021 (July 5 - 9) has an outstanding lineup of keynote speakers. The program is very likely to make US based attendees night-owls
EARL Conference 2021 (September 6 - 10), the premier R in industry event, will be online this year. The call for abstracts is already open.
IEEE Conferences:
IEEE BigData
IEEE International Conference on Technologies for Homeland Security HST
Art of efficient R coding
data.table: your best fRiend
The data.table package developed by Matt Dowle is a game changer for many data scientists
Learn about it, Share your favourite data.table trick here
https://github.com/Rdatatable/data.table
https://github.com/chuvanan/rdatatable-cookbook
http://r-datatable.com (https://rdatatable.gitlab.io/data.table/)
https://www.datacamp.com/courses/time-series-with-datatable-in-r
https://www.datacamp.com/courses/data-manipulation-in-r-with-datatable
https://github.com/Rdatatable/data.table/wiki/Articles
https://rpubs.com/josemz/SDbf - Making .SD your best friend
data.table vs. dplyr
data.table (Computer language) way vs. dplyr (“English language”) way
- The best: No wasted computations .No new memory allocations.
dtLocations %>% .[WLOC == 4313, WLOC:=4312]
- No new memory allocations, but computations are done with ALL rows.
dtLocations %>% .[, WLOC:=ifelse(WLOC==4313, 4312, WLOC)]
- The worst: Computations are done with ALL rows. Furthermore, the entire data is copied from one memory location to another. (Imagine if your data as in 1 million of cells, of which only 10 needs to be changed !)
dtLocations <- dtLocations %>% mutate(WLOC=ifelse(WLOC==4313, 4312, WLOC))
NB: dtLocations %>% . [] is the same as dtLocations[]. so you can use it in pipes.
Conclusion: Use data.table for speed and efficient coding instead of dplyr (i.e.tibbles)!
Extensions of data.table
There’s considerable effort to marry data.table package with dplyr package. Here are notable ones:
Reading various kinds of data in R
vroom
My favourite methods for reading / writing “regular” .csv files has been ‘data.table::fread() / fwrite()’ - the fastest and automated in many ways. Now there’s another one - with package ‘vroom’ - https://cran.r-project.org/web/packages/vroom/vignettes/benchmarks.html
Then, of course, there are other kinds of data you want to read - efficiently (meaning, automatically and fast):
bad data, badly formatted data, sparse data,
distributed “big” data
just very large and very very large
from MS excel, MS Words
from clouds: AWS, MS Azure etc
from pdf, html
from zip files
from google docs, google sheets
from GCdocs and from other GC platforms (that was one of the questions at our Friday’s R meetup), and,
and finally, from all other IoT and web-crawling
readxl and xlsx
For reading Excel files, I used so far readxl. I would like nevertheless to be able to import a set of columns formed by non-contiguous columns of a sheet (something possible to select in the newer versions of Excel using data queries).
For writing Excel files, I used xlsx, as definitely, I need to be able to write multiple sheets in a file.
Discussion
The analyst should, actually, never stick to one solution but rather adapt to the needs of the project. For every format that is good and efficient with big data, you gain either 1) get manipulation overhead that does not make sense when manipulating small datasets, and they can end up slower than even dataframes in that case on small data but hundreds of times faster in big data (e.g. feather), or 2) need to wait forever and lose storage space for nothing (parquet) if the data is not big enough. Yet, if you found the right solution for every size and need, it will make a world of difference.
…
The example below does a comparison of some popular formats when used with Pandas (Python). You will get similar results if you try the same experiment in R. https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
…
One of the options that I recommend, if your are only playing locally and not in the cloud, is using the feather format with sql.
If you need to extract data from a database and do more advanced data engineering without loading data in your RAM, you need SQL to prepare the extraction and do basic to advanced manipulation (SQL is Turing-complete, eh).
For more advanced and permanent transformations to the data, you need stored procedures (SQL again).
And if you play in the cloud, this is even more essential. For example, in AWS, you can create user-defined functions in Redshift using Python and Postgres SQL, but not R. All manipulation needs to be done in SQL, and Python can be used for other purposes such as calculations and fuzzy matching.
You can still use R in the rebranded Jupyter notebooks (Sagemaker in AWS, Azure Notebooks in Azure), but R is not as widely compatible in other cloud applications as SQL and Python. - [ PD: But you can absolutely use R in AWS for ETL. In fact you could even set up API endpoints via plumbr, there’s a whole AWS tutorial that deals with this issue]
References:
https://github.com/pingles/redshift-r/
Provides a few functions to make it easier to access Amazon’s Redshift service from R.
http://www.rforge.net/RJDBC/index.html
install.packages(“RJDBC”,dep=TRUE)
RJDBC is a package implementing DBI in R on the basis of JDBC. This allows the use of any DBMS in R through the JDBC interface. The only requirement is working Java and a JDBC driver for the database engine to be accessed.
feather (for larger than gigb):
https://blog.rstudio.com/2016/03/29/feather/
parquet ( for verrrrry large files)
https://campus.datacamp.com/courses/introduction-to-spark-with-sparklyr-in-r/case-study-learning-to-be-a-machine-running-machine-learning-models-on-spark?ex=4
Conclusions: As a side note on size, speed, and performance : it all depends on what you do, the delays, and the cost.
For example, if you use the cloud:
If your data is going to be queried very often, so you have large volumes of data that would be scanned, move your processing to a runtime-billed tool (e.g. Redshift in AWS) rather than a data-billed tool (e.g. Athena in AWS). Otherwise, your cost may increase exponentially if users can survey data freely from, say, Tableau dashboards without caring for the actual amount of data that is queried. So if the data is queried 24/24h, your cost is stable and won’t increase with volume.
If you may scan large volumes once or twice a day, then you would have to compare the costing options.
If the costing model is incremental by runtime and you have very large amounts of data that you need to query quickly, then it would be best to use columnar formatted tables such as parquet. There is a cost and delay involved for the conversion, and you need much more storage because of the flattened structure, so storage will be more expensive (especially considering that you clone your original data and use at least twice the space then). However, queries will fly, and the cost of computation will be much smaller thereafter.
For occasional queries, a data-billed tool would likely be the best option.
If you want to prototype with small datasets, do not lose time with parquet… CSV is the worst format after Excel files (which need to be unpacked and repacked), in any scenario, but the time investment in time to convert data is not worth it at all. Data.table and DT will be your best friends in R.
As for using SQL vs packages such as DPLYR, I mentioned a gain in performance, but be careful. If you use raw SQL, then you will see a big gain in performance. However, there are packages out there that translate SQL to R or Python interpretable code, and those will possibly be slower due to the interpretation layer. DPLYR, on the other hand, is quite efficient and well optimized. As usual, it depends on the packages. In R, the sqldf package should be good, if you want to try it out.
Efficient programming in R
TBA
RStudio tricks
Running multiple RStudio versions
You want to be able to run multiple versions of RStudio in Windows? You can do with the following executable .bat script.
# Run-RStudio-1.4.bat
@echo off
title Starting RStudio!
echo Hello Dear,
echo Starting RStudio 1.4 (from C:\Users\gxd006\Downloads\RStudio-1.4.1106\bin\) for you...
C:\Users\abc123\Downloads\RStudio-1.4.1106\bin\rstudio.exe
Set WshShell = CreateObject("WScript.Shell")
WshShell.Run chr(34) & "C:\Users\gxd006\Downloads" & chr(34), 0
Set WshShell = Nothing
Visualization and Reporting
R Markdown for literate programming and automated reports
This discussion thread is for gathering knowledgebase related to R Markdown: https://rmarkdown.rstudio.com/
It can be used to generate reports, slides, websites, dashboards, shiny app, books, emails.
It is the tools allows you to do literate programming, which – as defined by Donald Knuth - is the type of programming where human language is combined with computer language, making the code much easier understood by your colleagues and yourself and coding much more fun.
ggplot2 and its extensions for data visualization
Many come to R (from Python and other languages/ systems) mainly because of the advanced data visualization capabilities it offers. There are many of those, as are the resources. Share your recommendations and examples here.
Shiny for Interactive Data Visualization, Analysis and Web App development
This discussion thread is dedicated to Shiny package - a RStudio-curated tool for developing and deploying Interactive Data Visualization and Analysis tools and applications.
Using R with GC data infastructure (gcdocs, AWS, etc)
gcdocs
Q:
I’m wondering if anyone has had any success accessing data from GCDocs in your respective departments ? I believe that GCDocs is implemented across most (if not all) departments so wondering if there are any existing solutions to read/write data from it.
Also wondering about whether any of you have had any luck accessing Microsoft 365 via R as well? I’ve had success with Microsoft365R package (https://github.com/Azure/Microsoft365R) from a personal point of view but it doesn’t play well (at least not in my department - ISED) with business accounts.
A:
I tried using the Microsoft365R package to access my departmental (ECCC) email without success. When I tried to access it, a window popped up allowing me to request access authorization so I clicked the “Submit” (or whatever) button. That was weeks ago. Never heard anything more about it.
After trying many ways and weeks, I found this is NOT possible. - It runs internal code on gcdoc end that validates that you have right to access it and then, if you do, it also logs your action within the document’s “Audit” attribute.
So we still have to always make a local copy of the data (manually!), and only then we can process it from R.
Interactive Outputs in R: plotly, Datatable, reactable
This discussion is dedicated to tools to generate interactive graphs, tables, and other content without Shiny (which, as you know, requires a server to host your Shiny application and which for this reason cannot be easily shared with your clients, e.g. by email)
Here are the most popular ones:
https://rstudio.github.io/DT/
https://plotly.com/r/ and
https://glin.github.io/reactable
Machine Learning and AI
Record Linking and other Data Engineering tasks in R
demo
See http://rcanada.shinyapps.io/demo and the #GCData2021 Data Engineering workshop presentation -
for the backgrounder and the demonstration of various DE tasks and solutions.
Geo/Spatial coding and visualization in R
Resources
There’s much effort across many GC departments to analyze and visualize geo-data. This discussion is the place to share your results, ideas or problems related to the problem.
Below is a great resource to start, which also provides a nice explanation on why R is believed to be the best language to do this kind of work.
Geocomputation with R, a book on geographic data analysis, visualization and modeling.
The online version of the book is hosted at https://geocompr.robinlovelace.net and kept up-to-date by GitHub Actions
Dealing with memory issues
A blog post that illustrates a few ways to avoid overloading R’s memory when working with large spatial objects (here’s looking at you, 30-m land cover map of North America!).
https://www.ecologi.st/post/big-spatial-data/
The two other posts on that blog also have some really nice tips for general R coding.
Canadian geo-data
Useful code and R packages from public domain to work with Canadian geo-data.
From https://mountainmath.ca
- https://github.com/mountainMath/mountainmathHelpers
- tongfen: Convenience functions for making data on different geometries, especially Canadian census geometries, comparable.
- cancensus : R wrapper for calling CensusMapper APIs
- cansim: Wrapper to access CANSIM data
- CanCovidData: Collection of data import and processing functions focused on Canadian data
Text Analysis in R
https://gccollab.ca/discussion/view/7404441/text-analysis-in-r
Plagiarism detection
Q: Any ideas/packages/resources (in R) for plagiarism detection?
A:
A good place to start is the “stylo” package (https://github.com/computationalstylistics/stylo - R package for stylometric analyses) which implements a wide variety of recent research in computational stylistics. Plagiarism detection is fraught (insert all of the usual ethical and computational caveats…), but stylo can help you identify passages that are stylistically unusual compared to the rest of the text. Unusualness definitely isn’t a proxy for plagiarism, but it’s a good place to start.
Q:
Is this focused on English language text? Are there lexicons or libraries for comparison within other languages (e.g., French)?
A: Stylo works well with quite a few non-English languages. French, for example, is supported, as are a number of languages with non-Latin alphabets like Arabic and Korean.
results from International Methodology Symposium
Two presentations at International Methodology Symposium were about Text Analysis with R, with great ideas from both:
11B-4 by Andrew Stelmach from StatCan: used library(fastText) - a very powerful package from facebook AI team for efficient learning of word representations and sentence classification.
11B-2 by Dave Campbell from Carleton U: used the approach that we discussed at L&L on October 9 (based on bag of words cosine distance / correlation) for matching beer products description - https://gccode.ssc-spc.gc.ca/r4gc/resources/text/), but addtionally applied SVD (singular value decomposition) to reduce comparison to the most imporant words, thus reducing significantly the dimension and speed.
You can find their decks here: https://drive.google.com/drive/folders/1TfuNmG3V8IEKDNNTcMZz7_YCKgqVVBju
Machine Learning and Modeling in R
GCCode 101
title: “gccode101: working with GCCode”
subtitle: (With focus on how to do it using RStudio)
date: September 2020
The information presented here includes only open public domain knowledge and does not include specific details related to operation of each GC department. Full tutorial and relatred Q&A are available at https://gccode.ssc-spc.gc.ca/r4gc/resources/gccode101
TL;DR
[One time action per life time] Make sure you have RStudio and git installed on your local machine: eg from Anaconda - Ask your IT to help you.
[One time action per project] Go to GCCollab and create a new repo there (it can be left empty or just add README.md there) or select the existing one where you want to work on: eg https://gccode.ssc-spc.gc.ca/r4gc/gc-packages/packages101
Generate Access Token (from Setting in left panel) or request one for a repo of which you are not the owner. It will look something like this: LNwVUF5YGnF-6x5fsnJ-
Open Windows PowerShell (or cmd), go to directory where you want to clone your repo (eg. cd C:_CODES_packages) and run this command: git clone --progress https://oauth2:LNwVUF5YGnF-6x5fsnJ-@gccode.ssc-spc.gc.ca/r4gc/gc-packages/packages101 gc-packages101. Close it - you wont need it again! You can check - your new directory contains .git/ folder ! (This is where your credentials for GCCollab are stored )
Open RStudio and create New project there. You have two options. In both cases, You’ll see the GIT button on top, once you finish and reload your project with RStudio, which means you are all set now and can start modifying / building your code!
Option A (To build a new R package): Choose New project -> New Directory -> New Project (with Name of your package, eg caPSES). Leave "Create a git directory box UNCHECKED. Once it is created, MOVE the entire content of this new package folder to the directory that you cloned in previous step (which contains .git/ folder), or vise versa just move .git/ folder from the cloned directory to your new package directory)
Option B (for any other project): Choose New project -> Existing directory -> point to your new cloned directory. That’s it.
- [Every time you make changes in project]
- We recommend to always
pull first (to avoid conflicts later)- from Git menu button
- Make changes, open commit window, describe them, commit them, push them back to repo
- Enjoy the rest of your day !
More details from our fRiday’s Lunch and Learn discussions follow below.
Step 00: Connecting to GCCode and installing required soft.
“Connecting” means being able to pull a repository to your local (gc-network connected) machine, modify code there (in RStudio is the easiest), commit your changes, and pull it back to GCCode.
Before doing that you need the following programs installed on your local (gc-network connected) machine,
Below (*) indicates options that have been tested as most efficient
R:
- from Anaconda
- from CRAN*: R version 4.0.2 (2020-06-22): As of Sept 2020, we have right to install packages directly from CRAN. So you can do
library(installr); updateR()
RStudio:
- From Anaconda Prompt in new environment*:
conda create -n e2020.03.02-markdown_issues mro-base rstudio (replace mro-base rstudio(replacee2020.03.02-markdown_issueswithYOUR_NEW_ENVIRONMENT_NAME`)
(This will create a shortcut (R) on desktop). - old: Version 1.1.456 – © 2009-2018 RStudio, Inc.
Git:
- From Anaconda Prompt*:
conda install git (or you can install in new environment e2020.03.02-markdown_issues)
- From Anaconda Prompt:
conda install m2-git (I installed in new environment e2020.07.21_mintty)
- Q: where is git.exe actually located in c: drive ?
cd \ .
- A: C:.21_mintty
- Check with IT - from web: https://git-scm.com/download/win (Git for Windows Portable (“thumbdrive edition”) -
64-bit Git for Windows Portable. NB: it’s not accessible from CBSA network)
Command windows (where you can run git commands):
- Windows
cmd (it may or may not work there for your machine)
- Anaconda prompt: this does not take all shell commands)
mintty* ( from www.msys2.org), install: conda install m2-base (creates ~/anaconda3/Library/mingw64 directory, git.exe installed with m2-git will be placed in /bin there).
- You can then run it from conda terminal:
mintty
- Run
which git (linux style from mintty) or where git(windows style from mintty) to find out path to git.exe
- Note how Linux-like file system is mapped to Windows’ one below:
- $ where git: C:.exe
- $ which git: /usr/bin/git
Windows PowerShell: works well but does not take all linux commands , e.g it does not know which/where, ls -a)
RStudio Terminal ** : it runs from git in your active environment from which you started RStudio. This allows to use different git.exe settings or executables.
There are additional recommended packages for efficient source control and collaborative code development, which however can be learnt later, such as :
Step 0: Configuring Windows, Git and GitLab (tokens)
1: Edit environment variable for your account (Search “env”) and set HOME to /c/users/gxd006/ (replace gxd006 with your user id). - This will become your home directory ~ for mintty and this is where your .gitconfig file will reside for git used in mintty! - needed for next step below
- Test it:
$ echo $HOME: /c/Users/gxd006/
2: Edit .gitconfig file as follows (note, it maybe invisable to your OS. The easiest way to open/edit it is using RStudio. )
Alternatively, you can view/edit using build-in vim editor: vim .gitconfig or by running git config --global -e.
- Three vim main commands:
- ESC a: (insert text after cursor), - ESC: wq (save and exit), - ESC: q! (dont save and exit)
NB: if you run terminal from RStudio (rather than from mintty), then git config --global -e will open .gitconfig file from where it is found using where git
[user]
name = Your Name
email = your.email@cbsa-asfc.gc.ca
[http]
proxy = "http://proxy.omega.dce-eir.net:8080"
NB: if you do not do that, you’ll be this getting error:
fatal: unable to access 'https://gccode.ssc-spc.gc.ca/super-koalas/shared-code/': Failed to connect to gccode.ssc-spc.gc.ca port 443: Timed out
3: Decide how/where you will organize your gccode on your machine. This is how I converged (after many iterations) to organize my projects - see file_structure.md
4: If you dont want to be typing your login id/password everytime you connect to GCCode (which I’m sure you dont:), read this article: https://knasmueller.net/gitlab-authenticate-using-access-token, and create your new personal access token there (which will look something like tjxrg3GyUQJJDMaA6LfHA)
Step 1: Find (or create) a GitLab project you want to contribute to.
Lets say, you want to contribute to this project: https://gccode.ssc-spc.gc.ca/r4gc/resources/gccode101
1.1: Go to ~/_CODES/GCCodes (this where you keep all your GCCodes projects),
and run from cmd terminal, or Anaconda prompt, or mintty:
CORRECTION:
Q: Currently works from conda or mintty termminal only. How to make git calleable from Windows cmd? I changed PATH to add a directory to C:\Users\gxd006\anaconda3\Library\bin it did not help
(base) C:\Users\gxd006>which git
/usr/bin/git
vs.
H:\>git
'git' is not recognized as an internal or external command,
operable program or batch file.
From Anaconda prompt, or mintty:
git clone --progress https://gccode.ssc-spc.gc.ca/r4gc/resources/gccode101 r4gc_gccode101
or better (if you set a personal token - see Step 0.4 above)
git clone --progress https://oauth2:tjxrg3GyUQJJDMaA6LfHA@gccode.ssc-spc.gc.ca/r4gc/resources/gccode101 r4gc_gccode101
Now you can go to the created directory r4gc_gccode101 and do something there, either from command line or directly from RStudio.
1.2 Using Command Line
Push a new file
cd existing_folder
touch README.md
git add README.md
git commit -m "add README"
git push -u origin master
Push an existing folder
cd existing_folder
git init
git remote add origin https://gccode.ssc-spc.gc.ca/r4gc/resources/gccode101
git add .
git commit -m "Initial commit"
git push -u origin master
Push an existing Git repository
cd existing_repo
git remote rename origin old-origin
git remote add origin https://gccode.ssc-spc.gc.ca/r4gc/resources/gccode101
git push -u origin --all
git push -u origin --tags
1.2 In RStudio
Recommended - see below
Step 2: Using Branches (optional)
It is recommended that you create your own branch for every project where you want to contribute (e.g. I made branch ivi for myself) and do everything there.
NB: you can also do it from RStudio or command line.
git fetch
git checkout ivi
git checkout master
git init .
touch some.txt
git add some.txt
git commit
git log
git status
git push
- For the rest of the presentation, we focus on using RStudio to do everything you need with GitLab in GCCode
Step 3: GCcoding from RStudio
You’ll note that GIT button is visible now there ! (that’s because it knows that this directory is cloned from gitlab)
Make some changes
Click on GIT button menu -> commit
Check on file(s) you want to commit, Describe you change, click Commit, click push - Voila! Done.
Packages 101
title: “How to convert your functions to package(s)”
subtitle: “GC Lunch and Learn: R packages 101”
author: Dmitry Gorodnichy and Joseph Stinziano
gitlab: https://gccode.ssc-spc.gc.ca/r4gc/gc-packages/packages101
date: “March -June 2021”
Taken from: https://gccode.ssc-spc.gc.ca/r4gc/gc-packages/packages101/-/blob/master/packages101.Rmd
How to contribute
Anyone: Fork Packages101 repo - modify - commit changes - push - submit request to merge
Members of r4gc group:
A token is generated to allow you to push/pull Packages101 repo.
Use this line to push/pull from Packages101 repo:
git clone --progress https://oauth2:tjxrg3GyUQJJDMaA6LfHA@gccode.ssc-spc.gc.ca/r4gc/gc-packages/packages101 r4gc_packages101
=======
* R script to start:
library(devtools)
devtools::create("rCanada") ## if you are in /my_package
devtools::create("../../rCanada") ## will it erase it,if I lready have it?
> #or usethis::create_package("../r4gc/packages/rCanada2")
library(roxygen2)
## library(testthis).<- we dont use this!Dont confuse it with library(testthat) ??
library(testthat) ## do we need it?
library(usethis)
## https://github.com/r-lib/usethis
usethis::use_testthat()
## use_testthat()
use_news_md()
use_test("iviDT")
x <- 1
y <- 2
use_data(x, y)
use_vignette(name="my_vignettes") #
use_vignette(name="data_linking")
use_package("data.table")
## > use_package("data.table")
## √ Adding 'data.table' to Imports field in DESCRIPTION
## * Refer to functions with `data.table::fun()`
use_package("magrittr")
use_package("data.table")
use_package("lubridate", type="Imports") ## how about order ?! lubridate must be after data.table !
use_package("stringr")
use_package("IVIM")
## To update .Rd files in ./man, run:
devtools::document()
## Warning: The existing 'NAMESPACE' file was not generated by roxygen2, and will not be overwritten.
## So delete it, and then it will be created
To be discussed
Q:
When I Clean and Rebuid:
????!
** testing if installed package can be loaded from final location
?????? C:/Users/gxd006/DOWNLO1/R-401.2 /etc/Renviron.site ?? May 12 19:25:09
Warning: replacing previous import ‘data.table::month’ by ‘lubridate::month’ when loading ‘IVIM’
* Setup
This is what you already have:
- “Original”" folder, with no specific structure,
with R and Rmd files that contain:
a) functions that that you want to be re-used by others (and yourself many months later!).
- They are tested and test codes are included in if(F) or in separate .Rmd, or, even better, in interactive Shiny App (eg https://rCanada.shinyapp.io/covid)
- Good idea to be have them in form that can be sourced: source("caCovid.R"); source("iviBase.R")
- functions and other codes that are not (yet) ready for re-use.
https://gccode.ssc-spc.gc.ca/r4gc/codes/tracking-covid-data
- caCovid.R
- iviBase.R
- … common.R, plot.R, etc
This is what you want to get:
- Package folder (or several folders).
- GCCode/r4gc/packages/rCanada
- GCCode/r4gc/packages/IVI
https://gccode.ssc-spc.gc.ca/gorodnichy/rCanada
Ways to do it:
You can create it in two ways:
From New Project
Either way RStudio will initialize and launch your new project.
In the second way, it also puts .gitignore
- In RStudio -> New Project -> New Direcory ->R package -> package name: caPSES
NB: there are many templates. Choose basic (“R package) and create in folder /my_packages
NB: If you have .R codes that are already source-able, attached them with”Add"button (one at a time), or you can copy them into /R folder later
note from Hadley: https://r-pkgs.org/workflows101.html
Call usethis::create_package(“path/to/package/pkgname”).
Or, In RStudio, do File > New Project > New Directory > R Package. This ultimately calls usethis::create_package(), so really there’s just one way.
Don’t use package.skeleton() to create a package. Because this function comes with R, you might be tempted to use it, but it creates a package that immediately throws errors with R CMD build.
From script
- Run from any R, Rmd window or R console the command below
NB: output from
> devtools::create("../r4gc/packages/rCanada")
> #or usethis::create_package("../r4gc/packages/rCanada2")
New project 'rCanada' is nested inside an existing project '../r4gc/packages/', which is rarely a good idea.
If this is unexpected, the here package has a function, `here::dr_here()` that reveals why '../r4gc/packages/' is regarded as a project.
Do you want to create anyway?
1: Yes
2: Absolutely not
3: Nope
Selection: 1
√ Creating '../r4gc/packages/rCanada/'
√ Setting active project to 'C:/Users/gxd006/Downloads/_CODES/GCCode/r4gc/packages/rCanada'
√ Creating 'R/'
√ Writing 'DESCRIPTION'
Package: rCanada
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
* First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to
pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
√ Writing 'NAMESPACE'
√ Writing 'rCanada.Rproj'
√ Adding '^rCanada\\.Rproj$' to '.Rbuildignore'
√ Adding '.Rproj.user' to '.gitignore'
√ Adding '^\\.Rproj\\.user$' to '.Rbuildignore'
√ Opening 'C:/Users/gxd006/Downloads/_CODES/GCCode/r4gc/packages/rCanada/' in new RStudio session
√ Setting active project to '<no active project>'
* Overall Workflow
.. Copy needed codes from MY_CODES to R directory
dt.replaceAwithB <- function(dt, col, a, b) {
dt[get(col)==a, (col):=b];
}
- Insert roxigen skeleton (from magic wand menu) and add description
NB: you may need to manually insert @ import and @ export
no need to Load All?
Always start from “Clean and rebuild”
Then Check
* .Rbuildignore
^.*.Rproj$
^.Rproj.user$
MY_CODES incorrect
^MY_CODES$ correct
^MY_DATASETS$
^LICENSE.md$
* License: use_mit_license()
License: MIT + file LICENSE
* DESCRIPTION
* NAMESPACE
https://r-pkgs.org/namespace.html
use_package("data.table")
## > use_package("data.table")
## √ Adding 'data.table' to Imports field in DESCRIPTION
## * Refer to functions with `data.table::fun()`
However this does notchange NAMESPACE ! Who changes is it?
is that devtools::document() ???
Generated by roxygen2: do not edit by hand
export()
export(addDerivatives)
export(extractMostInfectedToday)
export(readCovidUofT.csv)
importFrom(lubridate,dmy)
importFrom(stringr,str_replace)
I deleted NAMESPACE so that roxigen can generate it!
then added manually there . Not sure that’s theway to do it !
import(data.table)
import(ggplot2)
import(lubridate)
import(magrittr)
import(IVIM)
It Worked !
If you are using just a few functions from another package, the recommended option is to note the package name in the Imports: field of the DESCRIPTION file and call the function(s) explicitly using ::, e.g., pkg::fun(). Alternatively, though no longer recommended due to its poorer readability, use @importFrom, e.g., @importFrom pgk fun, and call the function(s) without ::.
* Examples and tests
.. In main file in /R folder
#' @examples
Some could be such :
#' \dontrun{}
.. In /tests folder
> use_testthat()
√ Setting active project to 'C:/Users/gxd006/Downloads/_CODES/GCCode/r4gc/packages/IVIM'
√ Adding 'testthat' to Suggests field in DESCRIPTION
√ Setting Config/testthat/edition field in DESCRIPTION to '3'
√ Creating 'tests/testthat/'
√ Writing 'tests/testthat.R'
* Call `use_test()` to initialize a basic test file and open it for editing.
Warning messages:
1: In readLines(f, n) :
incomplete final line found on 'C:/Users/gxd006/Downloads/_CODES/GCCode/r4gc/packages/IVIM/DESCRIPTION'
...
incomplete final line found on 'C:/Users/gxd006/Downloads/_CODES/GCCode/r4gc/packages/IVIM/DESCRIPTION'
> use_vignette()
Error in check_vignette_name(name) :
argument "name" is missing, with no default
.. in MY_CODES
Provide as .R, Rmd or shiny
* Documentation
devtools::document()
https://kbroman.org/pkg_primer/pages/docs.html
#' For more details see the help vignette:
#' \code{vignette("help", package = "mypkg")}
or
\href{../doc/help.html}{\code{vignette("help", package = "mypkg")}}
* Vignettes
> use_vignette(name="my_vignettes")
√ Adding 'knitr' to Suggests field in DESCRIPTION
√ Setting VignetteBuilder field in DESCRIPTION to 'knitr'
√ Adding 'inst/doc' to '.gitignore'
√ Creating 'vignettes/'
√ Adding '*.html', '*.R' to 'vignettes/.gitignore'
√ Adding 'rmarkdown' to Suggests field in DESCRIPTION
√ Writing 'vignettes/my_vignettes.Rmd'
* Modify 'vignettes/my_vignettes.Rmd'
I later renamed my first vignette to intro.Rmd - manually
https://bookdown.org/yihui/rmarkdown-cookbook/package-vignette.html
Delivering package
- [Easiest way] Put binary in package repo. Then you can install it simply using
install.packages("IVIM", repos = "https://gccode.ssc-spc.gc.ca/r4gc/gc-packages/IVIM/-/blob/master/versions/IVIM_0.0.0.9000.tar.gz")
- this .tar ball is obtained by running
devtools::build() or devtools::check() which will place it in “../yourRPackageProject” directory.
- [CONFIRM THIS!] Put source in github. Then people will be able to install using
devtools::install_github("https://gccode.ssc-spc.gc.ca/r4gc/gc-packages/IVIM")
Short tutorials
Geo/Spatial coding and visualization with R. Part 1:
Contributed by:
Text Analysis with R. Part 1:
Contributed by:
Dual Coding - Python and R unite !
Contributed by:
Working with ggtables
Contributed by:
Automate common look and feel of your ggplot graphs
Contributed by:
Automated generation of report cards
Contributed by:
Shiny Apps
Source: https://open-canada.github.io/Apps/
These Applications have been built with contributions from data scientists across the Government of Canada, using open source tools and data, many as an outcome of the R4GC community training and socializing.
Appendices
Lunch and Learn series
This is the scheduled agenda for the Lunch and Learn ‘Data Science with R’ series organized by R4GC community.
Please see Lunch and Learn page for details on how join us for these meetings
12 Nov - 19 Nov 2021:
Text Analysis with R follow-up / Converting codes to Shiny App
8 Oct 2021:
Text Analysis with R. Part 1: identifying near-duplicate documents
1 Oct 2021:
Shiny App to summarize very large, high-dimensional tables (code & app provided)
30 Jul - 17 Sep 2021:
Geo/Spatial coding and visualization with R. (code provided)
16 Jul 2021:
Dual Coding - Python and R unite ! (code provided)
9 Jul 2021:
Exploring ggplots (recording, code provided)
2 Jul 2021:
Parsing GC Tables (code provided)
25 Jun 2021:
Using the Open Government Portal API within R (recording, code on github.com/open-canada)
21 Apr 2021:
Analyzing PSES results using R and Shiny
16 Apr - 15 May 2021:
Building R packages (recording, codes provided)
Français
Pour l’instant, ce livre est développé seulement en anglais. Cependant, plusieurs de commentaires et trucs qu’on partage dans notre communauté proviens également dans la langue française aussi.
On espère qu’un moment donné
on peut
d’avoir de ce livre dans le deux langues officielles: en anglais en français
On même peux se servir des outils automatisés et intelligence artificielle - car nous sommes les scientifiques de données, n’est pas ? - pour automatiser le traduction de contexte de ce livre!
Si vous êtes intéressé de contribuer à nos efforts de traduire ce livre dans le langue de votre choix,
veuillez contacter Dmitry Gorodnichy.
knitr::knit_exit()